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Abstract: The development of wavelet theory has in recent years spawned 
applications in signal processing, in fast algorithms for integral transforms, 
and in image and function representation methods. This last application has 
stimulated interest in wavelet applications to statistics and to the analysis of 
experimental data, with many successes in the efficient analysis, processing, 
and compression of noisy signals and images. 

This is a selective review article that attempts to synthesize some recent 
work on "nonlinear" wavelet methods in nonparametric curve estimation 
and their role on a variety of applications. After a short introduction to 
wavelet theory, we discuss in detail several wavelet shrinkage and wavelet 
thresholding estimators, scattered in the literature and developed, under 
more or less standard settings, for density estimation from i.i.d. observa- 
tions or to denoise data modeled as observations of a signal with additive 
noise. Most of these methods are fitted into the general concept of reg- 
ularization with appropriately chosen penalty functions. A narrow range 
of applications in major areas of statistics is also discussed such as par- 
tial linear regression models and functional index models. The usefulness 
of all these methods are illustrated by means of simulations and practical 
examples. 
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1. Introduction 

Nonparametric regression has been a fundamental tool in data analysis over 
the past two decades and is still an expanding area of ongoing research. The 
goal is to recover an unknown function, say g, based on sampled data that are 
contaminated with noise. Denoising techiques provide a very effective and simple 
way of finding structure in data sets without the imposition of a parametric 
regression model (as in linear or polynomial regression for example). Only very 
general assumptions about g are made such as that it belongs to a certain class 
of functions. 

During the 1990s, the nonparametric regression literature was arguably dom- 
inated by (nonlinear) wavelet shrinkage and wavelet thresholding estimators. 
These estimators are a new subset of an old class of nonparametric regression 
estimators, namely orthogonal series methods. Moreover, these estimators are 
easily implemen ted through fast algorithms so they are very appea l ing in prac- 
tical situations. iDonoho and Johnstone! ( 1994^ and iDonoho et al. (|l995l) have 



introduced nonlinear wavelet estimators in nonparametric regression through 
thresholding which typically amounts to term-by-term assessment of estimates 
of coefficients in the empirical wavelet expansion of the unknown function. If an 
estimate of a coefficient is sufficiently large in absolute value - that is, if it ex- 
ceeds a predetermined threshold - then the corresponding term in the empirical 
wavelet expansion is retained (or shrunk toward to zero by an amount equal to 
the threshold); otherwise it is omitted. 

Extensive reviews and descriptions of the various classical and Bayesian 
wavelet shrinkage and wavelet thresholding estimators are given in the books 
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bv lOgdenl (|l997l ). IVidakovid (|l999l ) an dlPercival and Waldenl (l2000h . in the pa- 
pers appeared in the edited volume by Miiller and Vidakovid ( 19991) . and in the 
review papers by Antoniacfisl ( 19971 ). Vidakovid* ! 1998b ) and Abramovich et al 



(2000). With the increased applicability of these estimators in nonparametric 
regression, several new wavelet based curve smoothing procedures have been 
proposed in the recent literature, and one of the purposes of this review is to 
present few of them under the general concept of penalized least squares regres- 
sion. It should be noted that most of these methods are usually implemented 
under the assumptions of dyadic sample size, equally spaced and fixed sample 
points, and i.i.d. normal errors. When the data does not meet one or both of 
these requirements, various modifications have been proposed and we will also 
mention few of them in the forthcoming sections. To keep the length of this ar- 
ticle reasonable we have not included in our discussion important developments 
in Bayesian wavelet denoising methods which deserve a survey by their own. 

Finally we provide a brief overview of a spectrum of wavelet applications 
in some real data problems. The reader should be cautioned, however, that 
the wavelet is so a large research area that truly comprehensive surveys are 
almost impossible, and thus, our overview may be a little restricted. At first 
we will be concerned with a semiparametric partially linear regression model 
with unknown regression coefficients and we will present a wavelet thresholding 
based estimation procedure to estimate the components of the partial linear 
model. We will also discuss a wavelet based variant of a popular dimension 
reduction reduction method for functional data analysis that is traditionally 
used in practice, namely MAVE. 

The rest of the paper is organized as follows. Section [2] recalls some known 
results about wavelet series, function spaces and the discrete wavelet transform. 
Section [3] briefly discusses the nonlinear wavelet approach to nonparametric re- 
gression and introduces several recent thresholding techniques that can be for- 
mulated as penalized least squares problems. We also discuss some block-wise 
thresholding procedures that have been shown to enjoy a high level of adaptation 
for wavelet series estimates and end with a short discussion on block threshold- 
ing methods for density estimation. Few wavelet based applications to complex 
problems are given in Section [4l Whenever necessary, the practical performance 
of the methods that are discussed is examined by appropriate simulations. 



2. A short background on wavelets 



In this section we give a brief overview of some relevant material on the wavelet 
series expansion and a fast wavelet transform that we need further. 



2.1. The wavelet series expansion 



The term wavelets is used to refer to a set of orthonormal basis functions gen- 
erated by dilation and translation of a compactly supported scaling junction 
(or father wavelet), 4>, and a mother wavelet, ip, associated with an r-regular 



A. Antoniadis/ Wavelet methods in statistics 



19 



multiresolution analysis of L 2 (R). A variety of different wavelet families now 
exist that combine compact support with variou s degre es of smoothness and 
numbers of vanishing moments (see, IPaubechiesI {l992)), and these arc now 



the most intensively used wavelet families in practical applications in statistics. 
Hence, many types of functions encountered in practice can be sparsely (i.e. 
parsimoniously) and uniquely represented in terms of a wavelet series. Wavelet 
bases are therefore not only useful by virtue of their special structure, but they 
may also be (and have been!) applied in a wide variety of contexts. 

For simplicity in exposition, we shall assume that we are wo rking with peri- 
odized wavelet bases on [0,1] (see, for example. iMallati (|1999n . Section 7.5.1), 
letting 

= X>i*(*-o and = 5>i*(*-o. for *e[o,i], 

zez zez 

where 

4> Jk (t) = v' 2 4>{2n - k), ip jk (t) = y/ 2 ip(2h - k). 

For any j > 0, the collection fc , k = 0, 1, . . . , 2*> - 1; tf ok , j > j > 0, k = 
0, 1, . . . , 2 3 ' — 1} is then an orthonormal basis of L 2 ([0, 1]). The superscript "p" 
will be suppressed from the notation for convenience. 

The idea underlying such an approach is to express any function g <E £ 2 ([0, 1]) 
in the form 

2 J '0-1 oo 2 J '-1 

9(t) = Y] a 0ok (j) 0ok {t) + V V jk ip jk (t), jo > 0, t € [0, 1], 



k=0 j=jo k=0 



where 



and 



l 

a< ok = (g, 4> jok ) = I g(t)ct> jok (t) dt, j >0, k = 0, 1, . . . , 2 J ° - 1 



ii 



/3 3k = (g, il> jk ) = [ g(t)ip jk (t) dt, j > jo > 0, k = 0, 1, . . . , 2 j - 1. 
Jo 

An usual assumption underlying the use of periodic wavelets is that the func- 
tion to be expanded is assumed to be periodic. However, such an assumption 
is not always realistic and periodic wavelets exhibit a poor behaviour near the 
boundaries (they create high amplitude wavelet coefficients in the neighborhood 
of the boundaries when the analysed function is not periodic). However, periodic 
wavelets are commonl y used because th e numerical implementation is partic- 
ular simple. While, as Ijohnstone (1994) has pointed out, this computational 



simplification affects only a fixed number of wavelet coefficients at each resolu- 
tio n level, we will als o present later on an effective method, developed recently 
bv lOh and Leel (120051 ). combining wavelet decompositions with local polynomial 



regression, for correcting the boundary bias introduced by the inappropriateness 
of the periodic assumption. 
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2.2. Function spaces and wavelets 

The (inhomogeneous) Besov spaces on the unit interval, B s pi [0, 1], consist of 
functions that have a specific degree of smoothness in their derivatives. More 
specifically, let the rth difference of a function / be 

and let the rth modulus of smoothness of / G L Pl [0, 1] be 

v r ,pi(f;t) = sup(||A^ r) /|Upi[o,i-r/i])- 

h<t 

Then the Besov seminorm of index (s,pi,p2) is defined for r > s, where 1 < 
pi,p% < oo, by 

1/P2 

, if 1 < p 2 < oo, 



1 f "r, P1 (f;h)\ p2 dh 



and by 



n ' o<h<i L h 
The Besov norm is then defined as 

\L»x+\f\B 



PI ,P2 



and the Besov space on [0, 1], B s pi p2 [0, 1], is the class of functions / : [0, 1] — ► M 
satisfying / E L Pl [0, 1] and |/|b* < oo, i.e. satisfying P2 < oo. The 

parameter pi can be viewed as a degree of function's inhomogeneity while s 
is a measure of its smoothness. Roughly speaking, the (not necessarily integer) 
parameter s indicates the number of function's derivatives, where their existence 
is required in an L Pl -sense; the additional parameter pi is secondary in its role, 
allowing for additional fine tuning of the definition of the space. 

The Besov classes include, in particular, the well-known Hilbert-Sobolev 
(H% [0,1], s = 1,2,...) and Holder (C S [0,1], s > 0) spaces of smooth func- 
tions (B| 2 [0, 1] and 5^,^(0,1] respectively), but in addition less-traditional 
spaces, like the space of bounded- variation, sandwiched between B\ ^0, 1] and 
B\ ^[0, 1]. The latter functions are of statistical i nterest becau se they allow for 
better models of spatial inhomogeneity (see, e.g., Mevei ( 19921 )). 



The Besov norm for the function / is related to a sequence space norm on 
the wavelet coefficients of the function. Confining attention to the resolution 
and spatial indices j > jo and k — 0, 1, . . . , 2° — 1 respectively, and denoting by 
s' = s+ 1/2 — 1/px, the sequence space norm is given by 

1/P2 

1%-^ = IKJU { > ] 2' v '"||/3 ( -ii£ > ■ I : f>2 - x. 
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where 



E 

fc=0 



/'i 



and 1 1/3 • 



2 J -1 

E 

fc=0 



pi 



If the mother wavelet ip is of regularity r > 0, it can be shown that the 
corresponding orthonormal periodic wavelet basis defined in Section 12.11 is an 
unconditional basis for the Besov spaces B s p2 [0, 1] for < s < r, 1 < p%, pi < 
oo. In other words, we have 



< 1101 



where K% and Ki are constants, not depending on /. Therefore the Besov norm 
of the function / is equivalent to the corresponding sequence space norm de- 
fined above; this all ows one to characterize Besov spaces i n term s of wavelet 
coefficients (see, e.g.. lMeven ( 1992t ): lDonoho and Johnstone! ( 1998^1. For a more 
detailed study on (inhomogeneous) Besov spaces we refer to iMeverl (|l992l) . 



2.3. The discrete wavelet transform 



In statistical settings we are more usually concerned with discretely sampled, 
rather than continuous, functions. It is then the wavelet analogy to the dis- 
crete Fourier transform which is of primary interest and this is referred to 
as the discrete wavelet transform (DWT). Given a vector of function values 
g = (g(ti), ...,g(t n )Y at equally spaced points ti, the discrete wavelet transform 
of g is given by 

d = W&, 

where d is an n x 1 vector comprising both discrete scaling coefficients, Cj k, and 
discrete wavelet coefficients, djk, and W is an orthogonal nxn matrix associated 
with the orthonormal wavelet basis chosen. The Cj k and djk are related to their 
continuous counterparts ctj k and (3jk (with an approximation error of order 
n^ 1 ) via the relationships 

Cjofc ~ Vn a jo k and d jk « \fn f3 jk - 

The factor ^fn arises because of the difference between the continuous and dis- 
crete orthonormality conditions. This root factor is unfortunate but both the 
definition of the DWT and the wavelet coefficients are now fixed by convention, 
hence the different notation used to distinguish between the discrete wavelet co- 
efficients and their continuous counterpart. Note that, because of orthogonality 
of W, the inverse DWT (IDWT) is simply given by 

g = W T d, 



where W T denotes the transpose of W. 

If n = 2 J for some positive integer J, the DWT and IDW T may be per - 
formed through a computationally fast algorithm developed by iMallatl (|l999h 
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that requires only order n operations. In this case, for a given jo & n d under pe- 
riodic boundary conditions, the DWT of g results in an n-dimensional vector d 
comprising both discrete scaling coefficients Cj k, k = 0, 2 J0 — 1 and discrete 
wavelet coefficients djk, j = jo> •••> J ~ lj & = 0, 2 J — 1. 

We do not provide technical details here of the order n DWT algorithm 
mentioned above. Essentially the algorithm is a fast hierarchical scheme for 
deriving the required inner products which at each step involves the action of low 
and high pass filters, followed by a decimation (selection of every even member of 
a sequence). The IDWT may be similarly obtained in terms of related filtering 
operations. For excel lent accounts of the DWT a nd IDWT in terms o f filte r 
op erators we refer to [ Nason and Silverman! (| 19951 ) , IStrang and Nguvenl (1996), 



or Burrus et al. (199 



3. Denoising by wavelet thresholding 

The problem of estimating a signal that is corrupted by additive noise is a 
standard problem in statistics and signal processing. It can be described as 
follows. Consider the standard univariate nonparametric regression setting 

Vi = g(U) + cr £i) i=l,...,n, (3.1) 

where are independent N(0, 1) random variables and the noise level a may be 
known or unknown. We suppose, without loss of generality, that U are within 
the unit interval [0, 1]. The goal is to recover the underlying function g from 
the noisy data, y = (j/i, . . . ,y n ) T , without assuming any particular parametric 
structure for g. 

One of the basic approaches to handle this regression problem is to consider 
the unknown function g expanded as a generalised Fourier series and then to 
estimate the generalised Fourier coefficients from the data. The original (non- 
parametric) problem is thus transformed to a parametric one, although the 
potential number of parameters is infinite. An appropriate choice of basis for 
the expansion is therefore a key point in relation to the efficiency of such an 
approach. A 'good' basis should be parsimonious in the sense that a large set of 
possible response functions can be approximated well by only few terms of the 
generalized Fourier expansion employed. Wavelet series allow a parsimonious 
expansion for a wide variety of functions, including inhomogeneous cases. It is 
therefore natural to consider applying the generalized Fourier series approach 
using a wavelet series. 

In what follows we assume that the sample points are equally spaced, i.e. 
ti = i/n, and that the sample size n is a power of two: n = 2 J for some positive 
integer J. These assumptions allow us to perform both the DWT and the IWDT 
using Mallat's fast algorithm. Note, however, that for non-equispaced or random 
designs, or sample sizes which are not a power of two, or data contaminated 
with correlated noise, modifications are needed to the standard wavelet-based 
estimation procedures that will be discussed in subsection 13.41 
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Due to the orthogonality of the matrix W, the DWT of white noise is also 
an array of independent N(0, 1) random variables, so from (|3 . 1 [) it follows that 



djk 



c j k +o- e jk , 
djk + & tjki 



= 0,1,. .-,2^-1, 
jo,..., J -I, k 



0. 



,2» 



1. 



(3.2) 
(3.3) 



where Cj k and djk are respectively the empirical scaling and the empirical 
wavelet coefficients of the the noisy data y, and ejk are independent N(0, 1) 
random variables. 

The sparseness of the wavelet expansion makes it reasonable to assume that 
essentially only a few 'large' djk contain information about the underlying func- 
tion g, while 'small' djk can be attributed to the noise which uniformly contam- 
inates all wavelet coefficients. Thus, simple denoising algorithms that use the 
wavelet transform consist of three steps: 

1) Calculate the wavelet transform of the noisy signal. 

2) Modify the noisy wavelet coefficients according to some rule. 

3) Compute the inverse transform using the modified coefficients. 

Traditionally, for the second step of the above approach there are two kinds of de- 
noising methods; namely, linear and nonlinear techniques. A wavelet based linear 
ap proach, extend ing simply spline smoot hing estimation m ethods as described 
bv Wahbal (Il990h. is the on e suggested bv I Antoniadis! |l996) and independently 
bv lAmato and Vuzal (| 19971 ) - This method is appropriate for estimating relatively 
regular functions. Assuming that the smoothness index s of the function g to 
be recovered is known, the resulting estimator is obtained by estimating the 
scaling coefficients Cj k by their empirical counterparts £j k and by estimating 
the wavelet coefficients djk via a linear shrinkage 



djk = 



djk 



1 + A2 2 J' S 



where A > is a smoothing param eter. The parameter A i s chosen by cross- 
validation in Amato and Vuza ( 1997 ) , while the choice of A in Antoniadia ( 19961 ) 
is based on risk minimization and depends on a preliminary consistent estima- 
tor of the noise level a. While simple and cheap to implement, the above linear 
method is not designed to handle spatially inhomogeneous functions with low 
regularity. For such functions one usually relics upon nonlinear thresholding or 
shrinkage methods . One of the e arlies t papers in the field of wavelet denois- 
ing may be that of IWeaver et al. (|l99lh . proposing a hard -thresholding s cheme 
for filtering noise from magnetic resonance images. While Weaver et al.l ( 1991 ) 
demonstrated the advantages of the wavelet thresholding scheme mainly based 
on experimental results, the first thorough mathematical treatment of wavelet 
shrinkage and wavelet thresholding was done by Don oho et al. in a series of 
techni c al reports in t h e early 1990's and p ubli shed in Donoho and Johnstone 
(|l994l ). lDonohol (|l995l ). |Ponoho et all (|l995l ) and lDonoho and Johnstone! (|l998h . 
Donoho and his coworkers analyzed wavelet thresholding and shrinkage meth- 
ods in the context of minimax estimation and showed, that wavelet shrinkage 
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generates asymptotically optimal estimates for noisy data that outperform any 
linear estimator. 

Mathematically wavelet coefficients are estimated using either the hard or 
soft thresholding rule given respectively by 

jtH/j x f if \djk\ < A , . 

s \(djk) = < s ., Ij ! . (3.4) 

L a jk if \djk\ > A 

and 

r o if \d jk \ < a 

S\(d jk ) = I d jk -X if d Jk > A (3.5) 
[ d jk + A if ijTc < -A. 

Thresholding allows the data itself to decide which wavelet coefficients are signif- 
icant; hard thresholding (a discontinuous function) is a 'keep' or 'kill' rule, while 
soft thresholding (a continuous function) is a 'shrink' or 'kill' rule. Beside these 
two possibilities there are many others (semi-soft shrinkage, firm shrinkage, . . . ) 
and as long as the shrinkage function preserves the sign (sign(<5A(x)) = sign(ir)) 
an d shrinks the m a gnitud e (\5\ (x)\ < \x\) one can ex pect a denoising effect. 
Gao and Bruce ( 1997 ) and iMarron et al. ( 1998h have shown that simple 



threshold values with hard thresholding results in larger variance in the func- 
tion estimate, while the same threshold values with soft thresholding shift the 
estimated coefficients by an amount of A even when \dj k \ stand way out of noise 
level, creating unnecessary bias when the true coefficients are large. Also, due to 
its discontinuity, hard thresholding can be unstable - that is, sensitive to small 
changes in the data. 

To remedy the d rawbacks of both hard and soft thresholding rules, 



Gao and Brucd (|1997h considered the firm threshold thresholding 



if fe|<Ai 

CA) = { signK-fc) ^^ if \i<\d jk \<\ 2 (3.6) 
dj k if \djk\ > A 2 

which is a "shrink" or "kill" rule (a continuous function). 

The resulting wavelet thresholding estimators offer, in small samples, advan- 
tages over both hard thresholding (generally smaller mean squared error and less 
sensitivity to small perturbations in the data) and soft thresholding (generally 
smaller bias and overall mean squared error) rules. For values of \dj k \ near the 
lower threshold Ai, 5^ \ 2 {djk) behaves like Sf i (dj k ). For values of \dj k \ above 

the upper threshold A 2 , 5^ \ 2 (djk) behaves like S^ 2 (dj k )- Note that the hard 
thresholding and soft thresholding rules are limiting cases of (13. 6|) with Ai = A 2 
and A 2 = oo respectively. 

Note that firm thresholding has a drawback in that it requires two threshold 
values (one for 'keep' or 'shrink' and another for 'shrink' or 'kill'), thus making 
the estimation procedure for the threshold v alues more computationally expen- 
sive. To overcome this drawback, Gao (|l998f) considered the nonnegative garrote 
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thresholding 



(djk) 




\dj k \ 
\d jk \ 



< A 
> A 



(3.7) 



which also is a "shrink" or "kill" rule (a continuous function). The resulting 
wavelet thresholding estimators offer, in small samples, advantages over both 
hard thresholding and soft thresholding rules that is comparable to the firm 
thresholding rule, while the l atter requir e s two threshold value s. 

In the same spirit to that in Gaol ( 1998h . Antoniadis and Fan ( 2001 ) suggested 
the SCAD thresholding rule 



rSCAD 



(djk) 



sign(d jk ) max (0, \d jk \ - A) if \d jk \ < 2A 

(a-lK- fc -aA S ign(rf jfc ) if 2A < |d ifc | < aA 

if \d jk \ > aX 



(3.. 



which is a "shrink" or "kill" rule (a piecewise linear function). It does not over 
penalize large values of \dj k \ and hence does not create e xcessive bias when the 
wavelet coefficients are large. lAntoniadis and Fan! (|200lh . based on a Bayesian 
argument, have recommended to use the value of a = 3.7. 

The interesting thing about wavelet shrinkage is, that it can be interpreted 
in very different ways, less known in the statistics literature, allowing a better 
understanding of wavelet shrinkage and providing an unified framework for many 
seemingly different thresholding rules in nonparametric function estimation. 



3.1. Wavelet shrinkage and nonlinear diffusion 

Nonlinear diffusion filtering and wavelet shrinkage are two methods that serve 
the same purpose, namely discontinuity-preserving denoising. In this subsec- 
tion we give a survey on relations between both paradigms when fully discrete 
versions of nonlinear diffusion filters with an e xplicit time di s cretiz ation are con- 
sidered. In particular we present the results of iMrazek et~aH §003) connecting a 



shift-invariant Haar wavelet shrinkage and the diffusivity of a nonlinear diffusion 
filter. This allows to present corresponding diffusivity functions to some known 
and widely used shrinkage functions or new shrinkage functions with competi- 
tive performance which are induced by famous diffusivities. Due to the lack of 
space we can only p resent the main ideas and refer the reader to the paper of 
Mrazek et al. (2003) for more details. Before proceeding, we would like first to 



recall some facts about translation-invariant denoising. 

One drawback of the discrete wavelet transform is that the coefficients of the 
discretized signal are not circularly shift equivariant, so that circularly shifting 
the observed series by some amount will not circularly shift the discrete wavelet 
transform coefficients by the same amount. This seriousl y degrades the quality 



of the denoising achieved. To try to alleviate this problem I Coifman and Donoho 



(| 19951 ) introduced the technique of 'cycle spinning'. The idea of denoising via 
cycle spinning is to apply denoising not only to y, but also to all possible unique 
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circularly shifted ver s ions o f y, and to average the results. As pointed out by 
Percival and Waldenl ( 2000h (see, p. 429) another way to perform the translation 



invariant shrinkage is by applying standard thresholding to the wavelet coeffi- 
cients of the maximal overlap discrete wavelet transform, a transform we more 
briefly refer to as the stationary wavelet transform and we refer to the above 
monograph for details. We are now able to introduce and analyze a general con- 
nection between translation invariant Haar wavelet shrinkage and a discretized 
version of a nonlinear diffusion. 

The scaling and wavelet filters h and h corresponding to the Haar wavelet 
transform are 

h = -^=(...,0,1,1,0,...) h= -^=(...,0,-1, 1,0,... ). 

Given a discrete signal / = (fk)kez, it is easy to see that a shift-invariant soft 
wavelet shrinkage of / on a single level decomposition with the Haar wavelet 
creates a filtered signal u — (uk)kez given by 

l/j. , of , f \ i 1 ( rS ( fk+l~ fk\ . X S ( fk — fk-1 

u k = t(A-i + 2/fc + jfc+i) + — = -6% = + Si 1 



2v / 2V V y/2 ) V y/2 

where Sf denotes the soft shrinkage operator with threshold A. Because the filters 
of the Haar wavelet are simple difference filters (a finite difference approximation 
of derivatives) the above shrinkage rule looks a little like a discretized version 
of a differential equation. Indeed, rewriting the above equation as 

r , fk+1 — fk fk — Jk-1 
Uk = Ik + 



4 4 

1 ( ( fk+1 ~ fk \ _|_ ( fk — fk-1 



fk + 



2V2 V V y/2 J V y/2 

(fk+l - fk) 1 jS/ fk+1 - fk 



4 2V2 V V2 

{fk - fk-l) f ( fk - fk-1 

A 



we obtain 

Uk - fk 



2V2 \ \/2 



(fk+i - fk)g(\fk+i - fk\) - (fk - fk-i)g(\fk - fk-i\), (3.9) 



At 

with a function g and a time step size At defined by 

""W>-i-iv!ff*($ 

Eq. (|3.9p appears as a first iteration of an explicit (Euler forward) scheme for a 
nonlinear diffusion filter with initial state /, time step size At and spatial step 
size 1, and therefore the shrinkage rule corresponds to a discretization of the 
following differential equation 

d t u = d x ((d x u)g(\d x u\)), (3.10) 
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with initial condition u(0) = f. This equation is a 1-D variant of the Perona- 
Malik diffusion equation well known in image processing, and the function g is 
called the dif fusivity. The basic idea be hind nonlinear diffusion filtering in the 
1-D case (see lDroske and Rumpfl <|2004) ) is to obtain a family u(x, t) of filtered 



versions of a continuous signal / as the solution of the diffusion process stated 
in Eq. (I3.10p with f as initial condition, u{x, 0) = ,f(x) and reflecting boundary 
conditions. The diffusivity g controls the speed of diffusion depending on the 
magnitude of the gradient. Usually, g is chosen such that it is equal to one for 
small magnitudes of the gradient and goes down to zero for large gradients. 
Hence the diffusion stops at positions where the gradient is large. These areas 
are considered as singularities of the signal. Since the Perona-Malik equation is 
nonlinear the existence of a solution is not obvious. 

We can now give a proposition which relates some properties of shrinkage 
functions and diffusivities and whose proof is an easy consequence of the re- 
lation between g and S\. A detailed analysis of this connection in terms of 
ex tremum principles, m onotonic ity pres e rvatio n and sign stability can be found 
in lMrazek et alJ ( 20031 ) (see also Lorenz (2006)). We formulate this relations for 



the case At = 1/4 which is a common choice and widely used for the Perona- 
Malik equation. This choice makes the relations more clear and relates some nice 
properties of the shrinkage function to other nice properties of the diffusivity. 

Propostion 3.1. Let At = 1/4. Then the diffusivity and the shrinkage function 
are related through 



V2, 



3(N)=1 -wHtW- (3 - n) 

The following properties hold: 

1. If 5\ performs shrinkage then the diffusion is always forward, i. e. 

Sx(\x\)<\x\^g(x)>0. 

2. If 5\ is differentiable at then, as x — > 0, 

g{x) -> 1 & 5 X (0) = and S' x (0) = 0. 

3. If the diffusion stops for large gradients the shrinkage function has linear 
growth at infinity, i. e. 

I \ n 5\(x) 

g[x) — > 0, as x — > oo — ■ ► 1, as x — > oo. 

Some examples will make the correspondence more clear. As suggested by 
Proposition 13.11 we choose At — 1/4 and derive the corresponding diffusivities 
by plug in the specific shrinkage function into (|3.11[) . 

Linear shrinkage A linear shrinkage rule, producing linear wavelet denoising 
is given by 5\(x) = jtj- The corresponding diffusivity is constant </(|x|) = 
tj^x) ' anc ^ ^ ne diffusion is linear. 
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Soft shrinkage The soft shrinkage function 8\{x) = sign(x)(|x| — A)+ gives 



9(\x\) = Q 



, which is a stabilized total variation diffusivity 



fsee lsteidl and Weickertf (|2002l )). 
Hard shrinkage The hard shrinkage function 6\(x) — x(l — In x \<\\(x)) leads 
to <7(|x|) = ^/u^^/Jxid^l) w hi cn is a piecewise linear diffusion that degen- 
erates for large gradients. 
Garrote shrinkage The nonnegative garrote shrinkage 5\(x) = (x — ^) (1 — 
I{\ x \<\}( x )) leads to a stabilized unbounded BFB diffusivity given by 

ff(M) = ■ z {[»i<^A>(l ar l) + t£ j {|x|>v5a}(M)- 

Firm shrinkage Firm shrinkage defined by eq. p.6p yields a diffusivity that 
degenerates to for sufficiently large gradients: 

if |x| < 
I I if v^Ai < |x| < V2A 2 . 
if Id > V2X 2 



9(\x\) 



( V2X 2 



(A2-A1) V |x| 





SCAD shrinkage SCAD shrinkage defined by eq. 



gives also a diffusivity 



that degenerates to 0: 



5(M) 




1 

0-2 



if |x| < y/2X 

if V2A < |x| < 2V2A 

if 2V2A < |x| < aV2A ' 

if Ixl > aV2A 




01234 01234 01234 01234 0123 
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12 3 



12 3 



12 3 



12 3 



12 3 



12 3 



FlG 1. Shrinkage functions (top) and corresponding diffusivities (bottom). The functions are 
plotted for \ = 1 ; Ai = 1, A2 = 2 (Firm) and a = 3.7 (Scad). The dashed line represents the 
diagonal. 



All these example are depicted in Figure[T] The other way round one can ask, 
how the shrinkage functions for famous diffusivities look like. The function 5\ 
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expressed in terms of g looks like <5a(M) = M(l — <?(\/2M) and the dependence 
of the shrinkage function on the threshold parameter A is naturally fulfilled be- 
cause usually diffusivities involve a parameter too. Using this remark we obtain 
the following new shrinkage functions: 



Charbonnier diffusivity The Charbonnier diffusivity (jCharbonnier et al 



(|l994h ) is given by g(\x\) 
age function 6\(x) = x [ 1 



1 



-1/2 



and corresponds to the shrink- 



A 2 



X 2 +2x 2 



Perona-Malik diffusivity The Perona-Malik diffusivity (jPerona and M alik 
((1990D) is defined by g(\x\) = (l + f^) and lead to the shrinkage func- 



tion S\(x) = 



2x* 



52^ 



Weickert diffusivity IWeickertl f|l998h introduced the following diffusivity 



g(\x\) = I{\ x \ >0 (x) (l - exp (~ (% 3 |/l 8 )t )) wn i cn leads to the shrinkage 

0.20718A 8 " 



function 5\(x) = a; exp ( — 



Tukey diffusivity Tukey's diffusivity, defined bv lBlack et al. ( 1998f ) as g( 

(1 — (x/A) 2 ) 2 /{| :E |< A (|a;|) leads to the shrinkage function 



Sx(x) 



TT-T^ if \x\<X/V2 
x if jacrj > A/V5 






Fig 2. "Classical" diffusivites (top) and corresponding shrinkage functions. 



Figure [2] illustrates these diffusivities and their corresponding shrinkage func- 
tions. Having developed the connection between diffusivities and shrinkage func- 
tions, we will further exploit them in the subsection that follows in order to show 
that they can all be interpreted as cases of a broad class of p enalized least squares 
estimators. This unified treatment and the general results of lAntoniadis and Far] 
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(2001) on penalized wavelet estimators allow a systematic derivation of oracle 
inequalities and minimax properties for a large class of wavelet estimators. 



3.2. Penalized least-squares wavelet estimators 

The various thresholding rules presented in the previous subsection play no 
monopoly in choosing an ideal wavelet sub-basis to efficiently estimate an un- 
known function observed with noise. It turns out that the corresponding nonlin- 
ear estimators can be seen as optimal estimators in a regularization setting for 
specific penalty functions, i.e. most of the shrinkage rules can be linked to a reg- 
ularization process under a corresponding and reasonable penalty function. Ex- 



plorin g the nature of these penalties and using results from I Antoniadis and Fan 



( 20011 ). it is then easy to show that the corresponding thresholding estimators 
have good sampling properties and are adaptively minimax. It is the aim of this 
subsection to integrate the diverse shrinkage rules from a regularization point 
of view. 

When estimating a signal that is corrupted by additive noise by wavelet based 
methods, the traditional regularization problem can be formulated in the wavelet 
domain by finding the minimum in of the penalized least-squares functional 
t{0) defined by 

e{0) = \\Wy-0\\i + 2\Y J P{m), (3-12) 

i>i 

where 9 is the vector of the wavelet coefficients of the unknown regression func- 
tion g andp is a given penalty function, while io is a given integer corresponding 
to penalizing wavelet coefficients above certain resolution level jo. Here to facili- 
tate the presentation we changed the notation d^k from a double array sequence 
into a single array sequence 6i. Similarly, denote by z the vector of empirical 
wavelet coefficients Wy. With a choice of an additive penalty J2i>i P(\@i\)> the 
minimization problem becomes separable. Minimizing (|3 . 1 2|) is equivalent to 
minimizing 

i{9i) = \\z i -0 i f + 2Xp(\9 i \), (3.13) 

for each coordinate i. The resulting penalized least-squares estimator is in this 
case separable, that is the estimate of any coordinate 6i depends solely on the 
empirical wavelet coefficient Zi- While separable estimators have their draw- 
backs, this is the case that we address here. To reduce abuse of notation, and 
because p{\9\) is allowed to depend on A, we will use p\ to denote the penatly 
function Xp in the following discussion. 

The performance of the resulting wavelet estimator depends on the penalty 
and the regulariza ti on p arame ter A. To s elect a good penalty function, 
Antoniadis and Far] (|200lh and iFan and Lil (|200lh proposed three principles 



that a good penalty function should satisfy: unbiasedness, in which there is 
no over-penalization of large coefficients to avoid unnecessary modeling biases; 
sparsity, as the resulting penalized least-squares estimators should follow a 
thresholding rule such that insignificant coefficients should be set to zero to 
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reduce model complexity; and continuity to avoid instability and large vari- 
ability in model prediction . The interested reader is referred to Theorem 1 of 
Antoniadis and Far] (|200lh which gives the necessary and sufficient conditions 



on a penalty function for the solution of the penalized least-suares problem to 
be thresholding, continuous and approximately unbiased when \z\ is large. Our 
purpose here is to show how to derive the penalties corresponding to the thresh- 
olding rules defined in the previous subsection, and to show that almost all of 
them satisfy these conditions. More precisely, we have 

Propostion 3.2. Let 5\ : R — > R be a thresholding function that is increasing 
antisymmetric such that < 5\ (x) < x for x > and 5\{x) — ► oo as x — ► oo. 
Then there exist a continuous positive penalty function p\, with p\(x) < p\(y) 
whenever \x\ < \y\, such that 8\{z) is the unique solution of the minimization 
problem ming(2: — 9) 2 + 2p\(\8\) for every z at which 5\ is continuous. 

Proof. Let r\ : R — > R defined by r\(x) = sup{z\5\(z) < x}. The function r\ 
is well defined, since the set upon which the supremum is taken is non empty 
(recall that 5\(z) — > — oo as z — > — oo) and upper bounded (since S\(z) — > oo as 
z -> oo. For 6 > 0, let 



Pa(0) = / (r\(u) - u)du, 
Jo 

and suppose that 5\ is continuous at 9. Let 



k(9) = (9- zf + 2 Px {9) - z l = 2 / (rx(u) - z)du, 

so that 



ii 



k(6) - k(r x {z)) = 2 / {r x {u)-z)du. 

Using the assumptions, it is easy to show that for 9 ^ r\(z), we have k(9) > 
k(r\(z)), so 9 = 6\(z) is the only solution to the minimization problem. The 
contracting property of the shrinkage function leads directly to the contracting 
property of the corresponding penalty. □ 

It is however interesting to recall here from the above proof the almost an- 
alytical expression for p\. Denote by r\ the generalized inverse function of S\ 
defined by r\(x) = sup{z|<5a(z) < x}. Then, for any z > 0, p\ is defined by 

px(z)= f {r x (u)-u)du. (3.14) 
Jo 

Note that all the thresholding function studied in the previous subsection satisfy 
the conditions of Proposition 13.21 Applying expression (|3.14[) one finds, in par- 
ticular, the well known ridge regression L2-penalty p^d^l) = A|6*| 2 correspond- 
ing to the linear shrinkage function, the Li-penalty Pa(|^|) = correspond- 
ing to the soft thresholding rule and the hard thresholding penalty function 
Pa(|0|) = A 2 - - A) 2 /{|ei| < A}(|0|) that results in the hard-thresholding rule 
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Linear shrinkage 



Soft shrinkage 



Hard shrinkage 



Garrote shrinkage 



Firm shrinkage 








-4-2 2 4 -4-2 2 4 -4-2 2 4 -4-2 2 4 -3 -1 123 



Scad shrinkage Charbonnier shrinkage Perona-Malik shrinkage Weickert shrinkage Tukey shrinkage 








Fig 3. Penalties corresponding to the shrinkage and thresholding functions with the same 
name. 



(see Antoniadisl ( 1997t )). Figure [3J displays all penalty functions corresponding 
to the various shrinkage functions studied in this review. 

Among the penalties displayed in Figure [31 the quadratic penalty, while con- 
tinuous is not singular at zero, and the resulting estimator is not thresholded. 
All other penalties are singular at zero, thus resulting in thresholding rules that 
enforce sparseness of the solution. The estimated wavelet coefficients obtained 
using the hard-thresholding penalty is not continuous at the threshold, so it 
may induce the oscillation of the reconstructed signal (lack of stability). In 
the soft-thresholding case, the resulting estimator of large coefficients is shifted 
by an amount of A, which creates unnecessary bias when the coefficients are 
large. The same is true for Charbonnier and Perona-Malick penalties. All other 
penalties have similar behaviour. They are singular at zero (thus encouraging 
sparse solutions) , continuous (thus stable) and do not create excessive bias when 
the wavelet coefficients are large. Most importantl y, all the above thresholdin g 
penalties satisfy the conditions of Theorem 1 in lAntoniadis and Fan! (|200lh . 
The implication of this fact is that it leads to a systematic derivation of ora- 
cle inequalitie s and minimax properties f or the resulting wavelet estimators via 
Theorem 2 of lAntoniadis and Fan! ( 200lh . In pa rticular, the optimal ha r d and 
soft universal threshold A = log 2 n given bv lDonoho and Johnstone! ( 1994 ) 
leads to a sharp asymptotic risk upper bound and the resulting penalized esti- 
mators are adaptively minimax within a factor of logarithmic order over a wide 
range of Besov spaces. 
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3.3. Numerical examples 

In this subsection, we illustrate the performance of the penalized least-squares 
methods introduced in the previous subsections by using some simulated data 
sets. For simulated data, we use the functions "heavisine", "blip", "corner" and 
"wave" as testing functions. The first three contain either jump points or cusps, 
while the fourth one is regular enough to see how the methods perform for 
regular functions. These, together with a typical Gaussian noisy data with a 
signal-to-noise ratio (SNR) of 3, are displayed in Figure [U 



Heavtelne(SNR=3) Corner (SHR=3) 




Fig 4. The four signals used in the simulations (solid lines) together with a typical noisy 
version (points). 

For each signal, we have used two noise levels corresponding to signal-to-noise 
ratios 3 and 7 (low and high). For each simulation and each function, we have 
used an equispaced design of size 512 within the interval [0, 1] and a Gaussian 
noise was added to obtain the observed data. The experiment was repeated 100 
times. Figure O dipslays a typical fit of the Heavisine function from corrupted 
data (SNR=3) and Table [1] reports the average mean-squared error over all 
Monte Carlo experiments for each method on each signal and signal-to-noise 
ratio combination. 

As one can see most methods perform similarly when an universal threshold is 
used. Note also the bias of the soft-thresholding rule and the similarity between 
the firm and the scad shrinkage. From Table [TJ we can see that the less classical 
nonlinear regularization methods are often superior to the hard-thresholding 
and the soft-thresholding method and always better than ridge regression. 

3-4- Notes and remarks 

In this whole section, we have tried to present a study for different denoising 
methods for signals observed on regularly spaced design and corrupted by ad- 
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Fig 5. A typical fit of the penalized least sqaures methods for the Blip function corrupted with 
an additive Gaussian noise at a signal-to-noise ratio of 3. 



Table 1 

Mean Squared Errors over 100 Simulations for each signal and SNR setting. 



Method 




Low SNR 








High SNR 








Hcavi 


Blip 


Corner 


Wave 


Hcavi 


Blip 


Corner 


Wave 


Weick 


80 


95 


63 


212 


46 


64 


27 


147 


Hard 


65 


76 


60 


152 


48 


57 


30 


147 


Soft 


113 


181 


67 


472 


46 


108 


26 


2 49 


Garrotc 


87 


103 


63 


256 


45 


71 


26 


160 


Firm 


70 


78 




185 


43 


58 


27 


133 


Lin 


168 


1849 


69 


4344 


45 


354 


27 


S12 


Perona 


105 


149 


60 


392 


43 


102 


27 


234 


Char 


136 


354. 


656 


9056 


446 


176 


26 


420 


Tukey 


84 


90 


57 


247 


42 


65 


28 


129 


Scad 


82 


90 


58 


241 


43 


65 


27 


140 



ditive Gaussian noise and have shown, that many of them are leading to the 
idea of shrinkage in a general sense. However, as stated already in the introduc- 
tion, when the data does not meet these requirements (equally spaced and fixed 
sample points, periodic boundary conditions and i.i.d. normal errors) various 
modifications have been proposed in the recent literature and we would like to 
shortly mention few of them in this subsection. 

An usual assumption underlying the use of wavelet shrinkage, that we have 
also adopted in the previous sections of this review, is that the regression func- 
tion is assumed to be periodic, in order to use a wavelet transform with peri- 
odic boundary con ditions. However, such an assumption is not always realistic. 
Oh and Leel (|2005h propose an effective hybrid automatic method for correcting 



the boundary bias in wavelet shrinkage introduced by the inappropriateness of 
such a periodic assumption. Their idea is to combine wavelet shrinkage with 
local polynomial regression, where the latter regression technique is known to 
possess excellent boundary properties. Results from their numerical experiments 
strongly support this approach. The interested reader is referred to their paper 
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for further details. Note that it is quite easy to implement the various thresh- 
olding strategies presented here within their hybrid approach. 

For data that are not equispaced , there exist numerous wavelet methods in 
this se t ting. Among them let us cit e Hall a nd Turlachl (119971). lAntoniadis et al.1 

1997 ). ICai and Brownl (ll998l ). lkovac and Silvermanl (l200dh . I Antoniadis and Far 

2001) 



Cai and Brownl (ll998h.lKovac and Silvermanl d200C . 
Maximl (12003^ IChickenl (|2003l ). iKerkvacharian and Picardl (|2004l ) and 



lAmato et al.l (|2006bf h Compared to standard algorithms, the thresholding is 
notably more complicated because it has to incorporate the variations of the 
density of the design. In many of these constructions, the first step consists in 
determining a function Y{x) of the form: 

Y(x) = w m (x)Y m , 



where w m (x) is a seq uence of functions suitably chosen. For instance, in 

iS correspond t o a polynomia l which depends 

103)). the WmS corre- 




sponds to sc ale wavelets warped with the (known) design cumulative distribution 



function. In lAntoniadis et alj (|1997T ). the random design is transformed into eq- 
uispaced data via a binning method and the weights w m are defined by scale 
wavelets. In a second step, the function Y(x) is expanded o n a standard wavelet 
basis and a hard thresholding algorithm is perfo rmed. In Antoniadis and Far] 



(2001) the w m s are Sobolev interpolation weights. Kovac and Silvermanl ( 2*000h 



apply first a linear transformation to the data to map it to a set of dyadic, equis- 
paced points. Since the transformation induces a band-limited correlation on the 
resulting wavelet coefficients, they adopt a term-by-term thresholding method 
similar to Vi suShrink, but taking into account the varying levels of noise. In 
IChickenl (l2003f) linear transformations and block thresholding are applied to 



nondyadic, non-i.i.d., nonequispaced data with various band- limited covariance 
structures. His method makes use of Kovac and Silvermans fast algorithm for 
estimating the covariances of the transformed data. In all the techniques de- 
scribed above, the thresholds have similar forms and depend on the quantity 
sup t j^pj where s is the density of the design points, and which corresponds 
to an upper bound for the varian ce of the estimated wavelet coefficients. The 
algorithm of lAmato et al. ( 2006b ) is quite different and relies upon wavelet ker- 



nel reproducing Hilbert spaces. The regularization method suggested in their 
paper requires neither pre-processing of the data by interpolation or similar 
technique, nor the knowledge of the distribution s of the design points. For this 
reason, the method works really well even in the case when the distribution of 
the design points deviates far from the uniform. When the estimation error is 
calculated at the design points only, the method achieves optimal convergence 
rates in Besov spaces no matter how irregular the design is. In order to ob- 
tain asymptotic optimality in the L2 metric, an extra assumption on the design 
points should be imposed, namely, the density s of the design points should be 
bounded a way from zero. To end this parag raph we mention the algorithm de- 



veloped in IKerkvacharian and Picard ( 2004 ). Their procedure stays very close 
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to the equispaced Donoho and Johnstone's Visushrink procedure, and thus is 
very simple in its form (preliminary estimators are no longer needed) and in its 
implementation (the standard uniform threshold suffices). On the other side, the 
projection is done on an unusual non-orthonormal basis, called warped wavelet 
basis, so their analytic properties need to be studied to derive the performances 
of the estimator. Some theoretical results, including maxiset properties are es- 
tablished in their paper. 

Con cerning the noise, we would like to mention a recent work bv lBrown et al.l 
(2006) who develop a nonparametric regression method that is simultaneously 



adaptive over a wide range of function classes for the regression function and 
robust over a large collection of error distributions, including those that are 
heavytailed, and may not even possess variances or means. Their approach is 
to first use local medians to turn the problem of nonparametric regression with 
unknown noise distribution into a standard Gaussian regression problem and 
then apply a wavelet block thresholding procedure to construct an estimator 
of the regression function. It is shown that the estimator simultaneously at- 
tains the optimal rate of convergence over a wide range of the Besov classes, 
without prior knowledge of the smoothness of the underlying functions or prior 
knowledge of the error distribution. The estimator also automatically adapts to 
the local smoothness of the underlying function, and attains the local adaptive 
minimax rate for estimating functions at a point. A key technical result in their 
development is a quantile coupling theorem which gives a tight bound for the 
quantile coupling between the sample medians and a normal variable. 

We would like to add some comments concerning the choice of the penalty 
parameter A in finite sample situations. From a practical point of view its op- 
timal choice is important. Given the basic framework of function estimation 
using wavelet thresholding and its relation with the regularization approach 
with penalties non smooth at 0, th e re ar e a variety of methods to choose 



the regularization parame ter A. ISolol (|200lf ) in his discussion of the paper by 



Antoniadis and Fan! (|200ll ) suggests a data based estimator of A simila r in spirit 



to the SURE selection criterion used bv lDonoho and Johnstone (1994), and pro- 



vides an appropriate simple SURE formula for the general penalties studied in 
this review. Another way to address the optimal choice of the regularization 
parameter is Generalized Cross Validation (GCV). Cross-validation has been 
widely used as an automatic procedure to choose the smoothing parameter in 
many statistical settings. The classical cross-validation method is performed by 
systematically expelling a data point from the construction of an estimate, pre- 
dicting what the removed value would be and, then, comparing the prediction 
with the val ue of the expelled p oint. One way to proceed is to use the approach 
adopted by Ijansen et al. ( 1997 ). It is clear that this is an area where further 



careful theoretical and practical work is needed. 

3.5. Block thresholding for nonparametric regression 



In the previous subsections we have used separable penalties, which achieve min- 
imax optimality through term-by-term thresholding of the empirical wavelet 
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coefficients by realizing a degree of trade-off between variance and bias con- 
tribution to mean squared error. However this trade-off is not optimal. While 
separable rules have desirable minimax properties, they are necessarily not rate 
adaptive over Besov spaces under integrated mean squared error because they 
remove too many terms from the empirical wavelet expansion, with the result the 
estimator being too biased and having sub-optimal L 2 -risk convergence rate (and 
also in L p , 1 < p < oo). One way to increase estimation precision is by utilizing 
information about neighboring empirical wavelet coefficients. In other words, 
empirical wavelet coefficients could be thresholded in blocks (or groups) rather 
than individually. As a result, the amount of information available from the data 
for estimating the "average" empirical wavelet coefficient within a block, and 
making a decision about retaining or discarding it, would be an order of mag- 
nitude larger than the case of a term-by-term threshold rule. This would allow 
threshold decisions to be made more accurately and permit convergence rates 
to be improved. This is the spirit of block thresholding rules and blockwise ad- 
ditive penalty functions that ha ve been an d are c u rrently theoretically exp l ored 
by many researchers (see e.g. Hall et al.l ( 1999 ). ICai and BrowrJ (fl999h. ICail 



(l2002h.lCai and Silverman! (|200lh . Ychickenl (<2003h and lChicken and Cail (|2005E 



Cai and Zhoul <|2005h ). For completeness we present below some more details on 



such block-wise thresholding procedures. 



3.5.1. A nonoverlapping block thresholding estimator 



A non overlapping block thresholding estimator was proposed bvlc^" and Brownl 



(1999) via the approach of ideal adaptation with the help of an oracle 



At each resolution level j = jo, . . . , J — 1, the empirical wavelet coefficients 
djk are grouped into nonoverlapping blocks of length L. In each case, the first 
few empirical wavelet coefficients might be re-used to fill the last block (which 
is called the augmented case) or the last few remaining empirical wavelet coef- 
ficients might not be used in the inference (which is called the truncated case) , 
should L not divide 1? exactly. 

Let (jb) denote the set of indices of the coefficients in the 6th block at level 
j, that is, 

(jb)={(j,k):(b-l)L + l<k<bL}, 

and let Sh b \ denote the L 2 -energy of the noisy signal in the block (jb). Within 
each block (jb), estimate the wavelet coefficients djk simultaneously via a Jamcs- 
Stein thresholding rule 

d% b) = max ( 0, ""52^" J d 3k- (3.15) 

Then, an estimate of the unkown function g is obtained by applying the IDWT to 
the vector consisting of both empirical scaling coefficients dj k (fc = 0, 1, . . . , 2- 70 — 
1) and thresholded empirical wavelet coefficients d^ (j = j , . . . , J — 1; k — 
0,1,...,V-1). 
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Cai and Brown ( 19991 ) suggested using L = logn and setting A = 4.50524 
which is the solution of the equation A — log A — 3 = 0. This particular threshold 
was chosen so that the corresponding wavelet thresholding estimator is (near) 
optimal in function estimation problems. The resulting block thresholding esti- 
mator was called BlockJS. 



Remark 3.1. \Hall et all \l99j ) and lHall et ail tl99c\ ), \Hall et all Il99. 



sidered wavelet block thresholding estimators by first obtaining a near unbiased 
estimate of the L 2 -energy of the true coefficients whithin a block and then keep- 
ing or killing all the empirical wavelet coefficients within the block based on the 
magnitude of the estimate. Although it would be interesting to numerically com- 
pare their estimators, they require the selection of smoothing parameters - block 
length and thershold level - and it seems that no specific criterion is provided 
for choosing these parameters in finite sample situations. 



3.5.2. An overlapping block thresholding estimator 



Cai and Silverman! (|200l!) considered an overlapping block threshol ding estima - 



tor by modifying the nonoverlapping block thresholding estimator of Cai ( 1999f) 



The effect is that the treatment of empirical wavelet coefficients in the middle 
of each block depends on the data in the whole block. 

At each resolution level j — jo, . . . , J — 1, one groups the empirical wavelet 
coefficients djk into nonoverlapping blocks (jb) of length L$. Extend each block 
by an amount L\ — max (1, [Lo/2]) in each direction to form overlapping larger 
blocks (jB) of length L = L + 2L\. 

Let Shg\ denote the L 2 -energy of the noisy signal in the larger block (jB). 
Within each block (jb), estimate the wavelet coefficients simultaneously via the 
following James-Stein thresholding rule 

d% b) = max 0, d jh . (3.16) 

V b UB) J 

Then, an estimate of the unkown function g is obtained by applying the IDWT to 
the vector consisting of both empirical scaling coefficients c J0/ t (k = 0, 1, . . . , 2- 70 — 

1) and thresholded empirical wavelet coefficients d^ (j = jo, . . . , J — 1; k = 

0, 1 2J-1). 

Cai and Silverman! (|200l! ) suggested using either L — [logn/2] and taking 



A = 4.50524 (which results in the NeighBlock estimator) or Lq = L\ = 1 (i.e., 
L = 3) and taking A = |logn (which results in the NeighCoeff estimator). 
NeighBlock uses neighbouring coefficients outside the block of current interest 
in fixing the threshold, whilst NeighCoeff chooses a threshold for each coefficient 
by reference not only to that coefficient but also to its neighbours. 

Remark 3.2. The above thresholding rule \3. 1 6\) is different to the one given in 
\3.15}) since the empirical wavelet coefficients djk are thresholded with reference 
to the coefficients in the larger block (jB). One can envision (jB) as a sliding 
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window which moves Lq positions each time and, for each window, only half of 
the coefficients in the center of the window are estimated. 

As noticed above, the block size and threshold level play important roles in 
the performance of a block thresholding estimator. The local block threshold- 
ing methods mentioned above all have fixed block size and threshold and the 
same thresholding rule is applied to all resolution levels regardless of the dis- 
tribution of the wavelet coefficients. In a recent paper, Cai and Zhou (2005) 
propose a data-driven approach to empirically select both the block size and 
threshold at individual resolution levels. At each resolution level, their proce- 
dure, named SurcBlock, chooses the block size and threshold empirically by 
minimizing Stein's Unbiased Risk Estimate (SURE). By empirically selecting 
both the block size and threshold and allowing them to vary from resolution 
level to resolution level, the SureBlock estimator has significant advantages over 
the more conventional wavelet thresholding estimators with fixed block sizes. 
For more details the reader is referred to the above paper. 

We would like to end this subsectio n with some rema r ks on the use of block 
thresholding for density estimation bv lChicken and Cail (|2005h . The reader in- 
terested in other wavelet based methods f or density es t imati on from i.i.d. obser- 
vations is referred to the review paper by I Antoniadis! (|l997t ) where some linear 
and nonlinear wavelet estimators in univariate density estimation are discussed 
in detail. Chicken and Cai's block thresholding procedure first divides the em- 
pirical coefficients at each resolution level into nonoverlapping blocks and then 
simultaneously keeps or kills all the coefficients within a block, based on the 
magnitude of the sum of the squared empirical coefficients within that block. 
Motivated by the analysis of block thresholding rules for nonparametric regres- 
sion, the block size is chosen to be logn. It is shown that the block thresholding 
estimator adaptively achieves not only the optimal global rate over Besov spaces, 
but simultaneously attains the adaptive local convergence rate as well. These 
results are obtained in part through the determination of the optimal block 
length. 

4. Some applications 

In this Section we present two recent applications of wavelets in statistics. Each 
application begins with a description of the problem under study and points to 
specific properties and techniques which were used to determine a solution. Of 
course many other excellent applied works on wavelets have been presented in 
the literature that we find it impossible to list them all in this review. We will 
mainly concentrate on some important uses of wavelet methods in statistics, 
developed recently by the author and its collaborators. 

4-1- Wavelet thresholding in partial linear models 

This subsection is concerned with a semiparametric partially linear regression 
model with unknown regression coefficients, an unknown nonparametric func- 
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tion for the non-linear component, and unobservable Gaussian distributed ran- 
dom errors. We present a wavelet thresholding based estimation procedure to 
estimate the components of the partial linear model by establishing the connec- 
tion made between an Li-penalty based wavelet estimator of the nonparametric 
component and Huber 's M-estimation of a standard linear model with outliers. 

Assume that responses y% , . . . , y n are observed at deterministic equidistant 
points ti — — of an univariate variable such as time and for fixed values Xj, 
i = 1, . . . , n, of some p-dimensional explanatory variable and that the relation 
between the response and predictor values is modeled by a Partially Linear 
Model (PLM): 

yi = X.?p + f(ti)+Ui i = l...n, (4.1) 

where /3 is an unknown p-dimensional real parameter vector and / is an un- 
known real- valued function; the Mi's are i.i.d. normal errors with mean and 
variance a 2 and superscript "T" denotes the transpose of a vector or matrix. We 
will assume hereafter that the sample size is n = 2 J for some positive integer 
J. Given the observed data (j/i,Xj)j = i ...„, the aim is to estimate from the data 
the vector (3 and the function /. 

Partially linear models are more flexible than the standard linear model be- 
cause they combine both parametric and nonparametric components when it is 
believed that the response variable Y depends on variable X in a linear way 
but is nonlinearly related to other independent variable t. As it is well known, 
model 14.11 is often used when the researcher knows more about the dependence 
among y and X than about the relationship between y and the predictor t, 
which establishes an unevenness in prior knowledge. Several methods have been 
proposed in the literature to consider partially linear models. One approach 
to estimation of the nonparametric component in these models is based on 
sm oothing splines reg r ession techniques a n d has bee n emp l oyed in particular 



sm ootmng splines regression teenmq ues a n a has bee n emp l oyed m particuia: 
by iGreen and YandelJ (| 19851) . lEngle et all d 19861) and iRicei (| 19861) among oth- 



ers. Kernel regressi on (see e.g. |Sp eckman (119881) ') and local polynomial fitting 
techniques (see e.g. lHamilton and Truona (|l997l )) have also been used to study 



partially linear models. An important assumption by all these methods for the 
unknown nonparametric component / is its high smoothness. But in reality such 
a strong assumption may not be satisfied. To deal with cases of a less-smooth 
nonparam etric componen t, a wavelet based estimation procedure, developed re- 
cently by Gannad ( 20061 ) (a PhD student of the author), is developed in what 



follows, and as such that it can handle nonparametric estimation for curves ly- 
ing in Besov spaces instead of the more classical Sobolev spaces. To capture key 
characteristics of variations in / and to exploit its sparse wavelet coefficients 
representation, we assume that / belongs to B* r ([Q; 1]) with s + l/n— 1/2 > 0. 
The last condition ensures in particular that evaluation of / at a given point 
makes sense. 
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Jj.A.1. Estimation procedure 

In matrix notation, the PLM model specified by (|4.1[) can be written as 

Y = Xf3 Q + F + U, (4.2) 

where Y = (Y[, . . . , Y n ^j T , X T = (Xi, . . . , X n ) is the p x n design matrix, 

and F = (/(ii), . . . f(t n )) T . The noise vector U = (ui, . . . , u„) T is a Gaussian 
vector with mean and variance matrix a 2 I n . 

Let now Z = WY, A = WX, 9 = WF and e = WU, where W is the 
discrete wavelet transform operator. Pre-multiplying (|4.1|) by W, we obtain the 
transformed model 

Z = A(3 + O + e. (4.3) 

The orthogonality of the DWT matrix W ensures that the transformed noise 
vector e is still distributed as a Gaussian white noise with variance a 2 I n . Hence, 
the representation of the model in the wavelet domain not only allows to retain 
the partly linear structure of the model but also to exploit in an efficient way the 
sparsity of the wavelet coefficients in the representation of the nonparametric 
component. 

With the basic understanding of wavelet based penalized least squares pro- 
cedures covered in depth in Section [3j we propose estimating the parameters 
(3 and 6 in model (|4. 3[) by penalized least squares. To be specific, our wavelet 
based estimators are defined as follows: 

(- n "1 

((3 n ,8 n ) = argmin{ J n (/3, 6) = V -(z 4 - Af(3 - 4 ) 2 + A V |0 4 | , (4.4) 
09,0) { ti 2 tt ) 

for a given penalty parameter A, where iq = 2 J0 + 1. The penalty term in 
the above expression penalizes only the empirical wavelet coefficients of the 
nonparametric part of the model and not its scaling coefficients. Remember 
that the L 1 -penalty is associated the soft thresholding rule. 

The regularizatio n method propo s ed ab ove is closely related to the method 
proposed recently by[Ch ang and Qui (|2004h . who essentially concentrate on the 



backfitting algorithms involved in the optimization, without any theoretical 
study of t he resulting estimates . The method also relates to the recent one de- 
veloped bv lFadili and Bullmord (|2005h where a variety of penalties is discussed. 



Note, however, that their asymptotic study is limited to quadratic penalties 
which amounts essentially in assuming that the underlying function / belongs 
to some Sobolev space. 

Let us now have a closer look at the minimization of the criterion J n stated 
in (|4.4[) . For a fixed value of /3, the criterion J n {(3, ■) is minimum at 

2 ( ^_j z i- A JP if*<»o, u , 

Ul{l3} Ugn( Zl -AjP)(\ Zl ~Aj(3\-\) + ifz> i0 . l4 '" j 
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Therefore, finding /3 n , a solution to problem ()4.4|1 . amounts in finding (3 n min- 
imizing the criterion J n ({3, 0({3)). However, note that 

n 

J„(/3,0(/3)) = 5> A (z 2 -Af/3) (4.6) 

i=i 

where p\ is Huber's cost functional with threshold A, defined by: 



u 2 /2 if \u\ < A, 

A|u|-A 2 /2 if |u| > A. 



pa(«) =<:,', , 9/ „ i if? (^) 



The above facts can be derived as follows. Let i > ig. Minimizing expression 
(|4.4p with respect to Oi is equivalent in minimizing j(8i) :— h(zi — Af/3 — Oi) 2 + 
\\6i\. The first order conditions for this are : j'(0i) — 0i~ [z%— Af /3)+sign(6'i)A = 

where j' denotes the derivative of j. Now, 

• if 6i > 0, then j'(0i) = if and only if t = Zi - Af/3 - A. Hence, if 
zi — Af/3 < A, 6i = and otherwise 0j = — — A. 

• if 6{ < 0, j'(6*i) is zero if and only iffli = Z{ — Af (3 + A; therefore, if 
Zi - Aff3 > -A, 8i = and otherwise Oi = Zi - Af f3 + A. 

This proves that for a fixed value of /3, the criterion (|4.4[) is minimal for 9((3) 
given by expression (|4.5p . If we now replace in the objective function J n we 

n 

obtain J„(/3, 0(/3)) = |£ ((% - - ^) 2 + A|^|) since k = Zi - Af(3 for 
i=io 

1 < iq. Now denoting by / the set / := {j = iq . . . n, \zj — Aj(3\ < A}, we 

find that J n (f3, 6{p)) = - Af/3) 2 + i^A 2 + A^ (\z z - Af(3\ - A) by 

I /c /c 

replacing Oi with (|4.5[) . which is exactly Huber's functional. 

This result allows the computation of the estimators (3 n et 6 n in a non- 
iterative fashion. The result ing form of the estimators allows us to study their 
asymptotic properties (see Gannazl ( 20061 )). Another benefit is that one can 



design estimation algorithms much faster than those based on backfitting. 

The estimation procedure may be summarized as follows. Using the observed 
data (Y,X) : 

1. Apply the DWT of order J = log 2 (n) on X and Y to get their corre- 
sponding representation A and Z in the wavelet domain. 

2. The parameter /3 is then Huber's robust estimator which is obtained 
without taking care of the nonparametric component in the PLM model: 

n 

/3„ = argminy2p x (z l - Af/3). 
(3 t^i 

In other words this amounts in considering the linear model z% = Af f3 +ei 
with noise ej = Oq j + ej . 
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3. The vector 6 of wavelet coefficients of the function / is estimated by soft 
thresholding of Z — Af3 n : 



The estimation of / is then obtained by applying the inverse discrete 
wavelet transform. Note that this last step corresponds to a standard 
soft-thresholding nonparametric estimation of / in the model: 



Remark 4.1. The wavelet soft-thresholding procedure proposed above is derived 
by exploiting the connection between an L\-based penalization of the wavelet 
coefficients of f and Ruber's M-estimators in a linear model. Other penalties, 
leading to different thresholding procedures can also be seeing as M-estimation 
procedures. For example, if 5\ denotes the resulting thresholding function, we 
can show in a similar way that the estimators verify 



with p\ being the primitive of u i— > u — 5\{u). From the above, one sees that 
hard thresholding corresponds to mean truncation, while SCAD thresholding is 
associated to Hampel's M-estimation. However, in this subsection, we only con- 
centrate on the properties of estimators obtained by soft thresholding. 

Existing results for semi-parametric partial linear models establish paramet- 
ric rates of convergence for the linear part and minimax rates for the nonpara- 
metric part, showing in particular that the the existence of a linear component 
does not changes the rates of convergence of the nonparametric component. 
Within the framework adopted in this paper, the rates of convergence are simi- 
lar, but an extra logarithmic term appears in the rates of the parametric part, 
mainly due to the fact that the smoothness assumptions on the nonparametrric 
part are weaker. Indeed, under appropriate assumptions, one has: 

Propostion 4.1. Let f3 n and n be the estimators defined above. Consider that 
the penalty parameter A is the universal threshold: A = o-y/2\og(n). Then it 
holds 




where Vi = 



Y i -X?0 n = f(U)+v i , t=l...n, 
XfOSo-AO+Ui. 



n 




i = 1 . . . n, 
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If in addition we assume that the scaling function <j) and the mother wavelet if) 
belong to C R and that ip has N vanishing moments, then, for f belonging to the 
Besov space B% r with < s < min(i?, N), one has 



||/n-/||2 = J 



'log(n) 



™here\\f n -f\\l = ti{f n -fY. 

A pro of of thi s prop osition together with the relevant assumptions may be 
found in Gannaz (|2006h . The presence of a logarithmic los s lies on the choice of 



the th reshold A: taking A which tends to 0, as suggested bv lFadili and Bullmord 



would lead to a minimax rate in the estimation of /3. The drawback is 



that the quality of the estimation for the nonparametric part of the PLM would 
not be anymore quasi- minimax. This phenomenon was put in evidence by iRicei 
(1986): a compromise must be done between the optimality of the linear part 



estimation with an oversmoothing of the functional estimation and a loss in 
the linear regression parameter convergence rate but a correct smoothing of the 
functional part. 



4-1-2. Estimation of the variance 

The estimation procedure relies upon knowledge of the variance a 2 of the noise, 
appearing in the expression of the threshold A. In practice, this variance is 
unknown and needs to be estimated. In wavelet approaches for standard non- 
parametric regression, a popular and well behaved estimator for the unknown 
standard deviation of the noise is the median absolute deviation (MAD) of the 
finest detail coefficients of the response divided by 0.6745 (see iDonoho et al. 
(1995)). The use of the MAD makes sense provided that the wavelet repre- 
sentation of the signal to be denoised is sparse. However, such an estimation 
procedure cannot be applied without some pretreatment of the data in a par- 
tially linear model because the wavelet representation of the linear part of a 
PLM may be not sparse. 

A QR decomposition on the regression matrix of the PLM allow to eliminate 
this bias. Since often the function wavelet coefficients at weak resolutions are 
not sparse, we only consider the wavelet representation at a level J = log 2 (n). 
Let Aj be the wavelet representation of the design matrix X at level J. The QR 
decomposition ensures that there exist an orthogonal matrix Q and an upper 
triangular matrix R such that 

^ = «(*)■ 

If Zj, 6qj and ej denote respectively the vector of the wavelets coefficients 
at resolution J of Y, F and U, model (|4.3p gives 

Q T zj = (*) [3 + Q T 9 ,j + Q T ej. 
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It is easy to see that applying the MAD estimation on the last components of 
Q T zj rather than on zj leads to a satisfactory estimation of a. Indeed thanks 
to the QR decomposition the linear part does not appear anymore in the es- 
timation and thus th e framework is similar to the one used in nonparametric 
regression. Following Donoho and Johnstone ( 19981 ). the sparsity of the func- 
tional part representation ensures good properties of the resulting estimator. 
The interested reader will find in Gannaz two particular optimization 

algorithms that may be used for estimating in a computationally efficient way 
the linear part of the model. 



4-2. Dimension reduction in functional regression 



In functional regression problems, one has a response variable Y to be predicted 
by a set of variables X\ , . . . , X p that are discretizations of a same curve X at 
points tx, ■ ■ ■ , t p , that is Xj = X(tj), j = 1, . . . ,p where the discretization time 
points tj lie in [0, 1] without loss of generality. A typical set-up includes near 
infra-red spectroscopy (Y represents the proportion of a chemical constituent 
and x is the spectrum of a sample, discretized at a sequence of wavelengths). 

In practice, before applying any nonparametric regression technique to model 
real data, and in order to avoid the curse of dimensionality, a dimension reduc- 
tion or model selection technique is crucial for appropriate smoothing. A possible 
approach is to find a functional basis, decompose the covariate curve X{t) ac- 
cordingly and wor k with the coefficie nts in a s pirit s imila r to that adopted by 
Martens and Naed (|l989h . chapter 3, lAlsberd (|l993l ) and iDenham and Brownl 
( 1993f h The aim is to explain or predict the response through an expansion 
of the explanatory process in a relatively low dimensional basis in the space 
spanned by the measurements of X , thus revealing relationships that may not 
be otherwise apparent. Dimension reduction without loss of information is a 
dominant theme in such cases: one tries to reduce the dimension of X without 
losing information on and with out re q uiring a model for Y\X. Borrowing 

terminology from classical statistics, ICookl ( 20001 ) calls this a sufficient dimen- 
sion reduction which leads to the pursuit of sufficient summaries containing all 
of the information on Y\X that is available from the sample. 

In this subsection we describe a wavelet based regression approach for re- 
gression problems with high dimensio nal X variables , developed recently by the 



author and its co-authors (see Amato et al. ( 2006ah ). The methodology relies 



upon the above notion of suf ficient dim e nsion reduction and is based on devel- 
opments in a recent paper bv lXia et all (l2002t ) where an adaptive approach for 
effective dimension reduction called the (conditional) minimum average variance 
estimation (MAVE) method, is proposed within quite a general setting. 



4.2.1. Wavelet based MAVE 



We describe below the application of the MAVE method via a wavelet decom- 
position of the explanatory covariate process. We suppose that each realization 
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X(t) of the covariate process will be modelled as X(t) = f(t) + s(t), t e [0, 1], 
where f(t) represents the mean at time t and s(t) is the observed residual varia- 
tion, which will be regarded as a realization of a second order weakly stationary 
process. Since a large number of signal compression algorithms are based on 
second order statistical information we will concentrate on covariance mod- 
elling, and the mean function will be removed by filtering or simple averaging. 
Thus, we assume hereafter that the covariate process has been centered, so that 
E(X{t)) = for all t. 

Consider then the following model 

Y = g((p 1 ,X),...,(f3 K ,X))+e (4.8) 

where e is a scalar random variable independent of X(t), {(3 s (t),s = 1, . . . , K} 
are K orthonormal functions belonging to L 2 ([0, 1]), and g is a smooth link 
function of R K into R. For D = K, we obtain a standard regression model with 
all explanatory variables (/3 s ,X),s = 1,,..,K entering independently. Provided 
that D < K, the regression function depends on X only through D linear func- 
tional of the explanatory process X. Hence, to explain the dependent variable 
Y, the space of K explanatory variables can be reduced to a space with a smaller 
dimension D. The dimension reduction methods aim to find the dimension D 
of the reduction space and a basis defining this space. 

Given a multiresolution analysis of L 2 ([0, 1]) and a primary level jo, as seen 
in Section [21 both X(t) and j3 s {t) can be decomposed as 

2 J '0-1 2 J -1 

x (t) = Y &o,k<£jo,* + Y Y Vjt^k 

fe=0 j>j k=0 

2 J 'o-l 2 3 -l 

fc=0 j>jo k=0 

with 



£?o,fc = ( X i<t>jo,k) and r) jik = (X, V>i,fc) 
c j ,k = {PsAh,k) and d* jtk = {/3 s ,i>j,k), 

s = 1, . . . , K and {£j ,fc, Vj,k}j,k sequences of random variables. By the isometry 
between L 2 ([0, 1]) and ^ 2 (1R), model l|4.8p can be also written as 

Y = g((f3 1 , 1 ),...,((3 Kl7 ))+e. (4.9) 

We have indicated by (3 S the £ 2 -sequence formed by the wavelet and scaling 
coefficients of @ s {t), s — 1,...,K; and by 7 the £ 2 -sequence formed by the 
wavelet and scaling coefficients of X(t). 

Usually, the sample paths of the process X are discretized. If we observe 
p = 2 J values of X(t), (Xx, ■ ■ ■ , X p ) — (X(t{), . . . , X(t p )), then, given the pre- 
vious notation, X(t) can be approximated by its 'empirical' projection onto the 
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approximation space Vj defined as 

2 3 °-l ,7-1 2 j -l 

k=0 3=30 k=0 

where £j ,k and %,fe are the empirical scaling and wavelet coefficients. We will 
collect them into a vector 7 J £ R p . Let /?/ be the projection of (3 S onto Vj and 
let us denote by (3 J S S K p the vector collecting its scaling and wavelet coeffi- 
cients, s = 1, . . . , K. The original model (|4.9jl is then replaced by its discrete 
counterpart 

^ = .9((/3i,7 J ),---,(/3^7 J >) +e- (4.10) 

As much as K and J are large enough and thanks to the isometries between 
Li and £2 and the compression properties of the wavelet transform, the original 
functional regression model (14. 8j) may be replaced by the above model (|4.10|) 
which is the candidate for further dimension reduction by MAVE. A compact 
way to write down model (14.10| is 

Y = 9 (B T j J ) + e, (4.11) 

B eR pxK , v = 2 J . The method is then applied to Eq. (|4~TT|) . For the sake of 
completeness, we briefly describe hereafter the MAVE method, as it is applied 
on data obeying model (|4.1ip . 

Let d represent now the working dimension, 1 < d < K. For an assumed 
number d of directions in model (|4.10[) and known directions J5o, one would 
typically minimize 

minE{y - E(Y\B%j J )} 2 , 

to obtain a nonparametric estimate of the regression function ¥,(Y\BqJ J ). The 
MAVE method is based on the local linear regression, which hinges at a point 
7q on linear approximation 

E(V|5 T r 7 ) « a + b T B^f' - 7o)- (4-12) 

Now, if di r ection s Bq are not known, we have to search their approximation B. 
Xia et all (|2002l ) propose to plug-in unknown directions B in the local linear 



approximation of the regression function and to optimize simultaneously with 
respect to B and local parameters a and b of local linear smoothing. Hence, 
given a sample (7/, Vi)"=i from ( x f J , Y), they perform local linear regression at 
every 7g =7/, i — 1, . . . , n, and end up minimizing 



min EEC'" [at + b\B T (7/ - #)] }' W (4-13) 



\ , 7 ;=i i=i 

where Ik represents the K x K identity matrix and wu are weights describing 
the local character of the linear approximation (|4.12[) (i.e., Wu should depend 
on the distance of points 7/ and y t ). 
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Xia et al.l (|2002n call the resulting estimator of B, MAVE and show that the 
simultaneous minimisation with respect to local linear approximation given by 
a,j, bj and to directions B results in a convergence rate superior to any other 
dimension-reduction method. Initially, a natural choice of weights is given by a 
multidimensional kernel function Kh- At a given 7q , 



w i0 = K h {^ 



7o J )/E^(^ 



7o J ), 



for i — l,...,n and a kernel function Kh{-), where h refers to a bandwidth. 
Additionally, when we already have an initial estimate of the dimension reduc- 
tion space given by B, i t is possibl e to it erate and search an improved estimate 
of the reduction space. IXia et all (|2002h do so by using the initial estimator 
B to measure distances between points 7/ and 7q in the reduced space. More 
precisely, they propose to choose in the iterative step weights 



w l0 = K h {B T ^{ - 7o J )V E K ^B T {li 7o 7 ))- 

i=l 

Repeating such iteration steps until convergence results in a refined MAVE 
(rMAVE) estimator. As one sees from the above equations, the initial estimate 
B depends on local linear smoothing performed with we ights computed v ia a 
multidimensional kernel on W. Since, by Theorem 1 in IXia et all (|2002l) the 
optimal kernel bandwidth h must be such that h = 0(\ogn/n 1 / p ), in order to 
avoid the curse of dimensionality and to stabilize the computations it is therefore 
advisable in practice to reduce the initial resolution log 2 p to some resolution 
J < log 2 p, and in such a way that the approximation of 7 by its projection 7^ 
does not affect the asymptotics. When assumi ng that the process X is a-regular 



Amato et al.l (|2006al) ). From now 



such a condition holds if 2 a J = 0{n 1 l p ) (see 
on, whenever we refer to MAVE, we mean its refined version rMAVE with 
such a choice of smoothing parameters. One may show that under appropriate 
assumptions on X and g and with such a choice of smoothing parameters, one 
gets optimal asympt otic rates. The interested reader is referred to the paper of 
Amato et al.l (|2006al) for more details. 

The described methods are capable of estimating the dimension reduction 
space provided we can specify its dimension. To deter mine the dimension d , 
Xia et all (|2002l ) extend the cross-validation approach of lYao and Tond (| 19941 ) . 
The cross-validation criterion is defined as 



CV(d) = 

3=1 



E 



Y t K h (B T (jt-^) 



i= tr^-Er= M ^^(5 T (7/-7 

for d > and for the special case of independent Y and X as 

1 " 

CV(0) = -Y / (Y l -Yf. 
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Consequently, the dimension is then determined as 

d = argminCV{d) 1 

0<d<K 

where K represents the initial number of basis functions in model ((47 



4-2.2. Numerical experiment 

To illustrate the performance of the dimensional reduction regression method 
proposed in this subsection we report he re some of the results of an extensive 
Monte Carlo simulation study realized in lAmato et al for a particular 

model. More precisely let H = L 2 ([0, 1]), X = (^(^))te[o.i] be a standard 
Brownian motion and e a mean zero Gaussian distribution with variance a 2 
independent of X. All curves 0%{t) and X(t) arc discretized on the same grid 
generated on an equispaced grid of p = 128 equispaced points t € [0, 1] . The 
observations Y are generated from i.i.d. observations of X and e according to 
the following model: 

Model 1. 

Y - cxp ((ft, X)) + exp (| (p 2 , X) |) + e, 

0i(t) = sin(37rf/2), (i 2 (t) = sin(57rf/2), cr = 0.1 

The motivation for this example is that the functions (3\ and j3 2 belong to 
the eigen-subspace of the covariance operator of X and therefore it represents 
the ideal situation where the EDR space is included in the central subspace. 

On several simulation runs, the number of directions chosen by cross-validation 
MAVE was 2 which is optimal since the functions f3\ and (3 2 are respectively 
the second and third eigenvalues of the covariance operator of X. Whatever the 
estimation of the directions i = 1, 2 are, the quality of prediction is related to 
how close the estimated projections 0i,X) are to the true projections (0i,X). 
In order to check this, Figure [6] displays the indexes ($i,X) versus {(3i,X) for a 
typical simulation run, showing a quite satisfactory estimation. 



5. Conclusion 



This survey paper has investigated several aspects of wavelet thresholding and 
has considered two recent applications of wavelet to solve some interesting statis- 
tical problems. We would like however to mention here few more types of signal 
processing problems where these methods have been used in practice. Wavelet 
analysis and denoising have been found particularly useful in detecting machin- 
ery fault detection. Typical examples of signals encountered in this field are 
vibration signa ls generated in de fective bearings and gears rotating at constant 
speeds (see e.g. lLada et al. l (l2002l) V Wavelet denoising procedures in conjunction 
with hypothesis testing have also been used for detecting change points in several 
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Fig 6. True projections versus 
wavelet MAVE method (WM). 



?2 2 4 WM ?2 2 

estimated projections for the simulated example with the 



biomedical applications. Typical examples are the detection of life-threatening 
cardiac arythmias in electrocardiographic signals (ECG) recorded during the 
monitoring of patiens, or the detection of venous air embolism in doppler heart 
sound signals recorded during surgery when the incision wounds lie above the 
heart. The same is true for functional analysis of varianc e problems and for 
nonpa r ametric mixed-effects models used in proteomics (e.g. lMorris and Carroll 
(2006). [Antoniadis and Sanatinasl (|2007h ). A number of interesting applications 
of wavelets may be also fo und in economic and financial applications (e.g. 
Ra msav and Lampartl (Il998hl and for times series analys is and prediction (e.g 



Frvzlewicz et al.l (|2003h . lAntoniadis and Sapatinad (|2003l )). In conclusion, it is 



apparent that wavelets are particularly well adapted to the statistical analysis 
of several types of data. 
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